Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Fix Jenkins Nightly Build #1161

Merged

Conversation

EdwardSnyder-NOAA
Copy link
Collaborator

@EdwardSnyder-NOAA EdwardSnyder-NOAA commented Nov 27, 2024

DESCRIPTION OF CHANGES:

The Jenkins nightly builds have been inconsistent or not working at all on the parallel works (PW) platforms. Some of the issues have been related to the instance's infrastructure on Azure or a conda conflict between the host machine and the conda built by the SRW App (originally seeing on GCP is now being observed for all PW platforms). This PR resolves the conda conflict by deactivating the host conda before activating the srw_app environment for all PW platforms. The solution for Azure requires configurations changes, which were done on the backend, but the solution can lead to increased runtimes for some task since the file system and controller node reside in Zone 1 while the compute node can be launched in any zone. Most tasks run just fine with the exception of the run_post task, which is addressed by increasing the wall time for the skill score/vx test we use for the nightly build. Lastly, removed unneeded old Azure spack-stack logic since we updated to a newer spack-stack build.

EDIT: removed the increase in the run_post task since it passed with Jenkins on Azure without it.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • derecho.intel
  • gaea.intel
  • hera.gnu
  • hera.intel
  • hercules.intel
  • jet.intel
  • orion.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
    • Azure: PR nightly build test results
    • AWS: PR nightly build test results
    • GCP: PR nightly build test results
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA -

These changes look good to me!

I was also able to successfully run the skill-score on Orion without issue:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202411  COMPLETE              58.70
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              58.70
+ SS_INDEX=0.99963
+ echo 'Skill Score: 0.99963'
Skill Score: 0.99963
+ [[ 0.99963 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Congrats! You pass check!

Approving PR now.

@MichaelLueken MichaelLueken added the bug Something isn't working label Nov 27, 2024
@EdwardSnyder-NOAA
Copy link
Collaborator Author

Ran fundamental testing suite on AWS and it passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE             156.85
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20241  COMPLETE              51.87
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE             120.77
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024112  COMPLETE             306.71
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20241127171  COMPLETE             130.93
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024112717135  COMPLETE             153.28
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             920.41

Detailed summary written to /contrib/Edward.Snyder/pw-vx-fix/expt_dirs/WE2E_summary_20241127192207.txt

This PR was tested with the Jenkins nightly build for all cloud platforms and they all passed. See result files in the Tests Conducted section.

Copy link
Collaborator

@rickgrubin-noaa rickgrubin-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved based on test results and summary provided.

MichaelLueken and others added 2 commits December 2, 2024 09:53
 to the linkcheck_ignore list to get around 403 client error
[fix-pw-vx-issue] Add URL to linkcheck_ignore list to correct 403 client error
@MichaelLueken
Copy link
Collaborator

The issue with the Doc Tests has been corrected. Since the Jenkins pipeline doesn't run on NOAA Cloud, I will now move forward with merging this work.

@MichaelLueken MichaelLueken merged commit 4fbdf7f into ufs-community:develop Dec 2, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants